-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: PROCESSES proof of concept #5598
Conversation
Cc: @bartlettroscoe |
Status Flag 'Pre-Test Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
@KyleFromKitware Can you provide some context for this PR? |
CC: @trilinos/kokkos @jhux2 asked:
The context is the Kitware contract work to help address #2422, in this case for running out test suites faster and more robustly on GPUs. This would be a big deal for ATDM. NOTE: I am hoping that we can add support to TriBITS to now require any changes to CMakeLists.txt files in any Trilinos package so this is just a proof of concept, not a PR we want to actually merge. @trilinos/kokkos, This proof-of-concept PR contains some changes to the way that Kokkos allocates GPUs. Can one of the Kokkos developers attend a meeting with @KyleFromKitware and myself to discuss these changes and what the goals are where to go from here? |
When running the test suite on a multi-GPU machine, Kokkos tests currently only use GPU 0 while leaving all of the other GPUs untouched. This creates a major bottleneck which limits the amount of tests that can be run at once while leaving large resources untouched and wasted. There is a WIP CMake branch which makes it possible to describe to CTest what hardware is available and evenly distribute it among tests. This PR takes advantage of the new functionality to ensure that tests utilize all of the GPUs. When running with these new settings, we were able to cut the total testing time almost in half with a select set of tests. This is still a proof of concept and is not ready to be merged. We need to come up with a way to teach TriBITS how to automatically apply this setting to all Kokkos-enabled tests, thus bringing this improvement to all of Trilinos. We also need to wait for the upstream CMake branch to land and be released. |
Depending on the machine, you will get different behavior for how GPUs are chosen. E.g., Sierra type platforms are different from our Power8/9 testbeds (Waterman/white/ride) I can elaborate. But I'll need to do that later. |
@KyleFromKitware All Trilinos PRs must be made against develop, not against master. I changed the base to develop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit goofy that you have to add the same command-line argument to all tests. If the build enables GPUs, the default behavior should be to use the GPUs for tests. Why does Kokkos need another command-line argument? Kokkos knows if CUDA was enabled (KOKKOS_ENABLE_CUDA
).
@@ -98,6 +98,7 @@ | |||
#include <Kokkos_CopyViews.hpp> | |||
#include <functional> | |||
#include <iosfwd> | |||
#include <string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trilinos periodically gets a snapshot of Kokkos:master. Changes to Kokkos need to go into Kokkos' repository. It's OK to make changes to Trilinos' snapshot, but they also need to make it into Kokkos around the same time, so that the next Kokkos snapshot into Trilinos won't clobber your changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. This is just a proof-of-concept, and will certainly be more fleshed out later on. I definitely don't intend for it to be merged as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to get changes into TriBITS upstream as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to submit the PR to Kokkos upstream until we've fleshed out exactly what we want this to look like. Once we've done that, I will then proceed to submit PRs to all the upstream projects that I modify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, just wanted to make sure :-)
I am following the same convention as |
FYI: I created kokkos/kokkos#2227 to try to see if a Kokkos developer can volunteer to meet with us to discuss the proposed Kokkos changes present in this PR. |
@mhoemmen said and asked:
Not a problem to pass command-line arguments into every test that uses Kokkos. We can put in a injection point in TriBITS to add whatever commandline arguments or set env vars for any test for an SE packages that depends on Kokkos. This is just one proposed approach. It is up to the SNL Kitware developers and the Kitware ctest developers to decide how to do this dance. TriBITS will be the glue to make it work. CTest is running the show in terms of deciding how many tests to run, knowing how many MPI ranks each test has, etc. We just need to get Kokkos to say "yes sir" when ctest tell it what GPUs to run on. Kokkos alone does not have the context to know all of this by itself (and I don't think you want to try to make Kokkos that smart). But @KyleFromKitware, can't we just set the env var |
That is one way to do it, but I didn't particularly like the idea of wrapping the whole command in another script, adding another layer of indirection. I'm also not sure what the implications are for having four different MPI ranks all in the same |
Also, Nvidia doesn't recommend using
|
On Sierra platforms (Sierra the machine), the LSF/IBM stack (bsub/jsrun) set CUDA_VISIBLE_DEVICES for you when you 'jsrun' a binary. The above is why I say you will get different behavior on different machines. The catch with CUDA_VISIBLE_DEVICES, is that your cuda card needs to be used by a process on the same NUMA node (effectively the same socket). You can run across NUMA, but I believe your process mask needs to set wide-open, so you can allocate memory on the 'remote' memory (that is the memory closest to the other socket) - this processing binding + cuda device thing has caused a number of issues on Oak Ridge Summit. A more robust approach may be to rely on the job submission system to handle binding + card allocation for you. On Vortex (SNL's sierra testbed), you get some ENVs set for you, that is the OMP variables are set based on the allocation I requested. They also set OMPI_* variables because IBM's mpi is a derivative of OpenMPI.
Next, I looked at how
Next, I tried 10
|
TLDR, a robust forward solution would be for CTest to support the idea of batch systems. If that interface is rich enough, then we could express different batch systems with it. E.g., SLURM, LSF, PBS. We then rely on the batch system to handle 'task' concurrency... they typically do this fairly well, since some of their customers use batch systems to spawn multiple binaries in parallel (think NOT MPI use cases). A challenge is how to use testbeds at SNL that provide a resource manager for getting nodes, but may not have a batch system (e..g, White/Ride - you use bsub to get a node, but your execute MPIRUN directly). I think what you are doing is trying to solve the problem with direct MPIRUN usage, which is harder. But in doing that, perhaps seeing how IBM/SLURM do it may be a useful approach (that is, LSF is doing stuff with CUDA_VISIBLE_DEVICES and handing out cuda_device_ids based on the CPU set) |
Oops, in my prior post, I didn't show how their I effectively put jsrun inside a loop of 20 tasks. Each tasks randomly sleeps between 5 and 30 seconds, then prints a message. This lets jobs start/finish at various times, but all jsrun commands were effectively issued at once. This would a change for CTest, as it would say that CTest needs to 'run' a bunch of jobs, then wait for their completion, rather than have a synchronous model of launch -> finish. Here is the test run:
|
@jjellio, ctest can already utilize batch systems. CTest has a way for a test to return its real runtime (and not wall-clock time). See TIMEOUT_AFTER_MATCH some discussion on this in: (If you don't have access to that, we can provide it.) This is how we essentially use srun on 'mutrino'. Actually, we need to try: again to see if that will now work after recent 'mutrino' update. |
'srun' and 'jsrun' are not batch systems though. They are used inside a batch system. Currently, we can define the 'run command' and some arbitrary arguments that go before 'np', but the piece that is missing and causing problems is that the 'run command' is used outside of an informed 'batch' environment. I think what could allow this to work, would be to annotate the test markup so that those 'arbitrary' arguments turn into something more generic that allow you to express what an 'sbatch + srun' test should look like. E.g.,
Suppose Cmake then looks for a module that can provide a suitable execution
lsf.cmake can then test for and set a run command (most likely jsrun, but on our testbeds it would be MPI). Unfortunately, running with something like 'mpirun' doesn't solve the problem of oversubscribing a node. A 'batch-aware' run tool does, and if those are used the process binding and utilization things become a non-problem. With the above In the above sbatch + srun is used, but clearly the markup would allow expression of other batch schemes. Testing then happens in a properly configured environment and so the run command can actually be written to ensure the bindings you want. |
@jjellio, you should look at: It would be great to get your input on this and see if we can address this without requiring changes to ctest. (Again, if you don't have access then Kitware can provide it to you.) |
No I do not have access. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of experimenting with this on this branch and on 'ride' and 'waterman', can we define another env var like KOKKOS_USE_GPU_CTEST_INFO
and if set to 1
, then read in these vars? That would avoid needing to pass --kokkos-device-ctest=gpus
into every function and it would allow us to test this strategy on the entire Trilinos test suite (not just a few MueLu tests).
auto local_rank_str = std::getenv("OMPI_COMM_WORLD_LOCAL_RANK"); //OpenMPI | ||
if (!local_rank_str) local_rank_str = std::getenv("MV2_COMM_WORLD_LOCAL_RANK"); //MVAPICH2 | ||
if (!local_rank_str) local_rank_str = std::getenv("SLURM_LOCALID"); //SLURM | ||
if (local_rank_str) { | ||
|
||
auto ctest_process_count_str = std::getenv("CTEST_PROCESS_COUNT"); //CTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of experimenting with this, can we define another env var like KOKKOS_USE_GPU_CTEST_INFO
and if set to 1
, then read in these vars? That would avoid needing to pass --kokkos-device-ctest=gpus
into every function and it would allow us to test this strategy on the entire Trilinos test suite (not just a few MueLu tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TriBITS will also need a way to set the `PROCESSES`` property of each test to list the GPU requirements. (The value will also be different for each test, since they have different numbers of processes.) What is your recommended strategy for doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like thy all look like:
TRIBITS_ADD_TEST( ...
NUM_MPI_PROCS <n>
PROCESSES "<n>,gpus:512"
...
)
where <n>
is just the number of MPI processes/ranks, right?
What is the significance of the 512
in gpu:512
in all of these? That was the estimate of memory (which we have no idea what it actually is)?
How do we want CTest to actually limit the number of tests that can run using the GPUs? Trying to estimate memory requirements is a complete guessing game.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing it by memory was just a quick example to show how it might work. I think doing it by threads would ultimately be better (no more than X threads on one GPU at a time... we will need to come up with a way to define how many threads one GPU can handle.)
As far as applying this to tests on a global level, I'm thinking we could set a variable like the following:
SET(TRIBITS_GPU_THREADS_PER_PROCESS 1)
and then TRIBITS_ADD_TEST()
will set this property appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleFromKitware, I don't think you can control the number of threads run on the GPU, can you? I think it just uses as many as it wants, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory has the same problem.
Either way, we need a way to define how much capacity a GPU has, and how much of that capacity is consumed by a single process.
If that number is the same (or roughly the same) for every test-process, then we can set a global TriBITS variable. If not, then every test has to have this property individually set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if the vast majority uses the same capacity, but there are a few outlier tests, then exceptions could be made for those individual tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleFromKitware, I think we need to know more about how a GPU managed multiple processes that run kernels on them. Do the requests from different processes to run kernels on a GPU run at the same time? Do they run in serial? What about the RAM used by each process that runs things on the GPU? We need to understand all of that before we can come with with a good strategy for having ctest limit what runs on the GPUs.
I think we need another meeting with some Kokkos developers to understand these constraints with how things run on a GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
In the meantime, do we still want to proceed with modifying Kokkos for CTest awareness, or would you prefer the wrapper script strategy that we discussed yesterday?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KyleFromKitware, I think the easiest thing to try is:
-
Add an env var to Kokkos to like
KOKKOS_USE_GPU_CTEST_INFO
that if set to1
, has the same impact as the current command-line argument--kokkos-device-ctest=gpus
. -
Update TriBITS on this branch to set the env var ``KOKKOS_USE_GPU_CTEST_INFO=1
for each test (using ctest property
ENVIRONMENT`). -
Update TriBITS on this branch to set the ctest property
PROCESSES "<n>,gpus:512"
where<n>
is taken from theNUM_MPI_PROCS <n>
argument for every test. (Don't need aPROCESSES
property exposed inTRIBITS_ADD_TEST()
.)
I think if we do that, then I think this will work and then we can run the entire Trilinos test suite with no modifications to any Trilinos CMakeLists.txt files.
NOTE: This is not quite what we want for production because ctest will limit running tests that may not even require a GPU but that may not be a big issue either.
a1b3ebe
to
a0016ae
Compare
|
58b639a
to
e9a2892
Compare
e9a2892
to
ed2b061
Compare
Closing in favor of #6840. |
@trilinos/
Description
Motivation and Context
How Has This Been Tested?
Checklist